Deep Learning I
2. 预备知识
WU Xiaokun 吴晓堃
xkun.wu [at] gmail
2021/02/28
也称为张量 tensor
Numpy的ndarray:仅支持CPU计算PyTorch和TensorFlow中Tensor:支持自动微分
难点:
0-d(标量)
1.0
1-d(向量)
[1.0, 2.7, 3.4]
2-d(矩阵)
[
[1.0, 2.7, 3.4]
[5.0, 0.2, 4.6]
[4.3, 8.5, 0.2]
]
3-d
[[[0.1, 2.7, 3.4]
[5.0, 0.2, 4.6]
[4.3, 8.5, 0.2]]
[[3.2, 5.7, 3.4]
[5.4, 6.2, 3.2]
[4.1, 3.5, 6.2]]]
[
[
[0.1, 2.7, 3.4]
[5.0, 0.2, 4.6]
[4.3, 8.5, 0.2]
]
[
[3.2, 5.7, 3.4]
[5.4, 6.2, 3.2]
[4.1, 3.5, 6.2]
]
]
4-d
[[[[. . .
. . .
. . .]]]]
5-d
[[[[[. . .
. . .
. . .]]]]]
Pandas (Python Data Analysis Library)
两种解释
向量加法:连接起点、终点
向量内积:度量相似度
行列维数需要一致
注意:Python中*默认是按元素乘法
矩阵乘法:空间线性变形
\begin{bmatrix} m_{00} & m_{01} & t_x \\ m_{10} & m_{11} & t_y \\ 0 & 0 & 1 \end{bmatrix}
不被此矩阵改变方向
Ax = \lambda x
一元微分(标量对标量)
| y | a | x^n | \exp(x) | \log(x) | \sin(x) |
|---|---|---|---|---|---|
| \frac{dy}{dx} | 0 | nx^{n-1} | \exp(x) | \frac{1}{x} | \cos(x) |
| y | u+v | uv | y=f(u), u=g(x) |
|---|---|---|---|
| \frac{dy}{dx} | \frac{du}{dx} + \frac{dv}{dx} | \frac{du}{dx}v + \frac{dv}{dx}u | \frac{dy}{du}\frac{du}{dx} |
多元微分(标量对向量)
df = \sum_i \frac{\partial f}{\partial x_i} dx_i = \frac{\partial f}{\partial x} d\mathbf{x}
微分:d\mathbf{x} = [dx_1, dx_2, .., dx_n]^T
(偏)导数:\frac{\partial}{\partial \mathbf{x}} = [\frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, .., \frac{\partial}{\partial x_n}]
多元微分(向量对向量)
\mathbf{x} = \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n\\ \end{bmatrix}, \mathbf{y} = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m\\ \end{bmatrix}
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial \mathbf{x}}\\ \frac{\partial y_2}{\partial \mathbf{x}}\\ \vdots\\ \frac{\partial y_m}{\partial \mathbf{x}}\\ \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1}, \frac{\partial y_1}{\partial x_2}, \ldots, \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1}, \frac{\partial y_2}{\partial x_2}, \ldots, \frac{\partial y_2}{\partial x_n}\\ \vdots, \vdots, \ddots, \vdots\\ \frac{\partial y_m}{\partial x_1}, \frac{\partial y_m}{\partial x_2}, \ldots, \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix}
多元微分(向量对标量)
\mathbf{y} = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m\\ \end{bmatrix}, \frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x}\\ \frac{\partial y_2}{\partial x}\\ \vdots\\ \frac{\partial y_m}{\partial x}\\ \end{bmatrix}
| x:(1,) | \mathbf{x}:(n,1) | |
|---|---|---|
| y:(1,) | \frac{\partial y}{\partial x}:(1,) | \frac{\partial y}{\partial \mathbf{x}}:(1,n) |
| \mathbf{y}:(m,1) | \frac{\partial \mathbf{y}}{\partial x}:(m,1) | \frac{\partial \mathbf{y}}{\partial \mathbf{x}}:(m,n) |
聪明的数学家非常善于提取、总结客观规律
| x:(1,) | \mathbf{x}:(n,1) | \mathbf{X}:(n,k) | |
|---|---|---|---|
| y:(1,) | \frac{\partial y}{\partial x}:(1,) | \frac{\partial y}{\partial \mathbf{x}}:(1,n) | \frac{\partial y}{\partial \mathbf{X}}:(k,n) |
| \mathbf{y}:(m,1) | \frac{\partial \mathbf{y}}{\partial x}:(m,1) | \frac{\partial \mathbf{y}}{\partial \mathbf{x}}:(m,n) | \frac{\partial \mathbf{y}}{\partial \mathbf{X}}:(m,k,n) |
| \mathbf{Y}:(m,l) | \frac{\partial \mathbf{Y}}{\partial x}:(m,l) | \frac{\partial \mathbf{Y}}{\partial \mathbf{x}}:(m,l,n) | \frac{\partial \mathbf{Y}}{\partial \mathbf{X}}:(m,l,k,n) |
思考:矩阵是二维张量,拓展到N维张量?
y=f(u), u=g(x) \Rightarrow \frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx}
手算举例:
z = (\langle \mathbf{x}, \mathbf{w} \rangle - y)^2, \frac{\partial z}{\partial \mathbf{w}} = ?
\begin{aligned} a &= \langle \mathbf{x}, \mathbf{w} \rangle\\ b &= a - y\\ z &= b^2\\ \end{aligned}
\begin{aligned} \frac{\partial z}{\partial \mathbf{w}} &= \frac{\partial z}{\partial b} \frac{\partial b}{\partial a} \frac{\partial a}{\partial \mathbf{w}}\\ &= 2b \cdot 1 \cdot \mathbf{x}^T\\ &= 2 (\langle \mathbf{x}, \mathbf{w} \rangle - y) \mathbf{x}^T\\ \end{aligned}
构建计算图:将计算分解成算子的有向无环图
z = (\langle \mathbf{x}, \mathbf{w} \rangle - y)^2, \frac{\partial z}{\partial \mathbf{w}} = ?
\begin{aligned} a &= \langle \mathbf{x}, \mathbf{w} \rangle\\ b &= a - y\\ z &= b^2\\ \end{aligned}
链式法则:
\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u_n} \frac{\partial u_n}{\partial u_{n-1}} \cdots \frac{\partial u_2}{\partial u_1} \frac{\partial u_1}{\partial x}
\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u_n} \left(\frac{\partial u_n}{\partial u_{n-1}} \left(\cdots \left(\frac{\partial u_2}{\partial u_1} \frac{\partial u_1}{\partial x}\right)\right)\right)
\frac{\partial y}{\partial x} = \left(\left(\left(\frac{\partial y}{\partial u_n} \frac{\partial u_n}{\partial u_{n-1}}\right) \cdots\right) \frac{\partial u_2}{\partial u_1}\right) \frac{\partial u_1}{\partial x}
前向:执行计算图,并存储中间结果
\begin{aligned} a &= \langle \mathbf{x}, \mathbf{w} \rangle\\ b &= a - y\\ z &= b^2\\ \end{aligned}
反向:计算梯度值,剪除不需要的枝,避免重复计算。
\begin{aligned} z &= b^2\\ \end{aligned}
\begin{aligned} \frac{\partial z}{\partial b} &= 2b\\ \end{aligned}
反向:计算梯度值,剪除不需要的枝,避免重复计算。
\begin{aligned} b &= a - y\\ \end{aligned}
\begin{aligned} \frac{\partial b}{\partial a} &= 1\\ \end{aligned}
反向:计算梯度值,剪除不需要的枝,避免重复计算。
\begin{aligned} a &= \langle \mathbf{x}, \mathbf{w} \rangle\\ \end{aligned}
\begin{aligned} \frac{\partial a}{\partial \mathbf{w}} &= \mathbf{x}^T\\ \end{aligned}
反向:计算梯度值,剪除不需要的枝,避免重复计算。
\begin{aligned} \textbf{h} &= \sigma(\textbf{W}_h \textbf{x} + \textbf{b}_h)\\ \textbf{o} &= \textbf{W}_o \textbf{h} + \textbf{b}_o\\ s &= \frac{\lambda}{2} \left( \lVert \textbf{W}_h \rVert_F^2 + \lVert \textbf{W}_o \rVert_F^2 \right)\\ \end{aligned}
\begin{aligned} \textbf{z} &= \textbf{W}_h \textbf{x}\\ \textbf{h} &= \phi(\textbf{z})\\ \textbf{o} &= \textbf{W}_o \textbf{h}\\ \mathcal{L} &= l(O,y)\\ \end{aligned}
经验风险是训练数据集的平均损失;而风险则是整个数据群的预期损失。
深度学习模型的目标函数通常有许多局部最优解。
f(x) = x \cdot \cos(\pi x), - 1.0 \le x \le 2.0
Taylor展开:函数的一阶近似
f(x + \epsilon) = f(x) + \epsilon f'(x) + O(\epsilon^2)
梯度:函数值增长最快的方向
牛顿法:二阶展开,需要能够计算二阶导数(Hessian)
学习率 learning rate:\eta
注意:地形平坦的区域步长自然变小
摆脱鞍点:
摆脱多个极小值:
张量,张量运算。 多元微分,自动求导。 链式求导:前向积累与反向传播。 基于梯度的优化。 实验:二维仿射变换。实验:基本激活函数。实验:ReLU合成法构造一般函数。
重点:张量、张量运算、三种基本激活函数、深度学习层间运算的一般形式;
难点:张量运算的几何解释、反向传播算法、基于梯度的优化、ReLU合成法、一致逼近理论。
(*) 列举5种基本的二维仿射变换,并应用于图片变形。
(*) 简述深度学习的几何解释。
(*) 用ReLU合成法拟合分段线性函数(-10,-10),(0,0),(5,-5),(10,20)。
(*) 假设\mathbf{Y}是(a,b,c,d)维张量,\mathbf{X}是(h,i,j,k,l)维张量,那么\frac{\partial \mathbf{Y}}{\partial \mathbf{X}}的维度是多少?
画出z = (\langle \mathbf{x}, \mathbf{w} \rangle - y)^2的计算图,并使用前向积累和反向传递计算\frac{\partial z}{\partial \mathbf{w}}。
简述梯度下降法的算法流程。
数据的容器。0D、1D、2D张量又分别成为标量、向量、矩阵。
数据不同表示之间的变换函数。
雅可比矩阵 Jacobian matrix(向量对向量)
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial \mathbf{x}}\\ \frac{\partial y_2}{\partial \mathbf{x}}\\ \vdots\\ \frac{\partial y_m}{\partial \mathbf{x}}\\ \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1}, \frac{\partial y_1}{\partial x_2}, \ldots, \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1}, \frac{\partial y_2}{\partial x_2}, \ldots, \frac{\partial y_2}{\partial x_n}\\ \vdots, \vdots, \ddots, \vdots\\ \frac{\partial y_m}{\partial x_1}, \frac{\partial y_m}{\partial x_2}, \ldots, \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix}
矩阵微分:
| x:(1,) | \mathbf{x}:(n,1) | \mathbf{X}:(n,k) | |
|---|---|---|---|
| y:(1,) | \frac{\partial y}{\partial x}:(1,) | \frac{\partial y}{\partial \mathbf{x}}:(1,n) | \frac{\partial y}{\partial \mathbf{X}}:(k,n) |
| \mathbf{y}:(m,1) | \frac{\partial \mathbf{y}}{\partial x}:(m,1) | \frac{\partial \mathbf{y}}{\partial \mathbf{x}}:(m,n) | \frac{\partial \mathbf{y}}{\partial \mathbf{X}}:(m,k,n) |
| \mathbf{Y}:(m,l) | \frac{\partial \mathbf{Y}}{\partial x}:(m,l) | \frac{\partial \mathbf{Y}}{\partial \mathbf{x}}:(m,l,n) | \frac{\partial \mathbf{Y}}{\partial \mathbf{X}}:(m,l,k,n) |
将链式法则应用于神经网络梯度值的计算,得到的算法叫作反向传播算法。
Objective: \text{arg min}_{W, b} J,
with: J = \lVert \overline{y} - y \rVert, \overline{y} = \sum \textcolor{blue}{relu}(\textbf{W} \textcolor{blue}{*} x + \textbf{b})
W_1 = W_0 - \nabla J * s
\underline{GT} = \textbf{W} \textcolor{blue}{*} \underline{input} + \textbf{b}
\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}, \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix}, \begin{bmatrix} s_x & 0 & 0 \\ 0 & s_y & 0 \\ 0 & 0 & 1 \end{bmatrix}, \begin{bmatrix} \cos(\theta) & \sin(\theta) & 0 \\ -\sin(\theta) & \cos(\theta) & 0 \\ 0 & 0 & 1 \end{bmatrix}, \begin{bmatrix} 1 & e_x & 0 \\ e_y & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}
\underline{output} = \textcolor{blue}{activate}(\underline{GT}) \underline{GT} = \textbf{W} \textcolor{blue}{*} \underline{input} + \textbf{b}
首先研究低维空间,归纳出规律,然后将规律泛化到高维。
只要模型的参数足够多,就能捕捉到原始数据中所有的映射关系。想象“\Omega路径”。
ReLU, sigmoid, tanh.
提供非线性。
\sum \textcolor{blue}{relu}(\textbf{W} \textcolor{blue}{*} \underline{input} + \textbf{b})
In approximation theory, both shallow and deep networks are known to approximate any continuous functions at an exponential cost.
构造方法: